loss change
Adversarial Weight Perturbation Helps Robust Generalization
The study on improving the robustness of deep neural networks against adversarial examples grows rapidly in recent years. Among them, adversarial training is the most promising one, which flattens the \textit{input loss landscape} (loss change with respect to input) via training on adversarially perturbed examples. However, how the widely used \textit{weight loss landscape} (loss change with respect to weight) performs in adversarial training is rarely explored. In this paper, we investigate the weight loss landscape from a new perspective, and identify a clear correlation between the flatness of weight loss landscape and robust generalization gap. Several well-recognized adversarial training improvements, such as early stopping, designing new objective functions, or leveraging unlabeled data, all implicitly flatten the weight loss landscape. Based on these observations, we propose a simple yet effective \textit{Adversarial Weight Perturbation (AWP)} to explicitly regularize the flatness of weight loss landscape, forming a \textit{double-perturbation} mechanism in the adversarial training framework that adversarially perturbs both inputs and weights. Extensive experiments demonstrate that AWP indeed brings flatter weight loss landscape and can be easily incorporated into various existing adversarial training methods to further boost their adversarial robustness.
MEGG: Replay via Maximally Extreme GGscore in Incremental Learning for Neural Recommendation Models
Shi, Yunxiao, Yang, Shuo, Zhang, Haimin, Wang, Li, Wang, Yongze, Wu, Qiang, Xu, Min
Recommender systems are widely used across a broad range of applications, with recommendation algorithms serving as their core. Among the myriad of algorithmic paradigms, recommendation models based on deep neural networks (commonly referred to as Neural Collaborative Filtering, or NCF [1]) have garnered significant traction within the industry due to their implementation simplicity and high efficiency in delivering effective results [1-8]. Traditionally, these recommendation algorithms follow the conventional deep learning paradigm, where models are trained on fixed datasets and then applied to unseen data under the assumption of a static data distribution. However, in many real-world applications, such as music streaming [9], news recommendation [10], Point-Of-Interest (POI) recommendation [11], movie recommendation [12], and e-commerce platforms [13], recommender systems operate in dynamic environments where user interaction data stream is continuously generated [14-16], reflecting the evolving nature of users' preferences. This implies that incoming streaming data, which has not been observed during training, may differ significantly from the original training data in terms of distribution. As a result, models previously trained in static environments, when deployed under dynamic conditions for extended periods, often experience a decline in predictive performance [17].
Hidden Breakthroughs in Language Model Training
Kangaslahti, Sara, Rosenfeld, Elan, Saphra, Naomi
Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model's capabilities. We demonstrate the promise of these hidden phase transitions as a tool for unsupervised interpretability.
Towards Robust Influence Functions with Flat Validation Minima
Ye, Xichen, Wu, Yifan, Zhang, Weizhong, Jin, Cheng, Chen, Yifan
The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation
Cui, Guofeng, Wang, Pichao, Liu, Yang, Ke, Zemian, Liu, Zhu, Bhat, Vimal
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.
Transformer verbatim in-context retrieval across time and scale
Armeni, Kristijan, Pranjiฤ, Marko, Pollak, Senja
To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.
Adversarial Weight Perturbation Helps Robust Generalization
The study on improving the robustness of deep neural networks against adversarial examples grows rapidly in recent years. Among them, adversarial training is the most promising one, which flattens the \textit{input loss landscape} (loss change with respect to input) via training on adversarially perturbed examples. However, how the widely used \textit{weight loss landscape} (loss change with respect to weight) performs in adversarial training is rarely explored. In this paper, we investigate the weight loss landscape from a new perspective, and identify a clear correlation between the flatness of weight loss landscape and robust generalization gap. Several well-recognized adversarial training improvements, such as early stopping, designing new objective functions, or leveraging unlabeled data, all implicitly flatten the weight loss landscape. Based on these observations, we propose a simple yet effective \textit{Adversarial Weight Perturbation (AWP)} to explicitly regularize the flatness of weight loss landscape, forming a \textit{double-perturbation} mechanism in the adversarial training framework that adversarially perturbs both inputs and weights.
MT2ST: Adaptive Multi-Task to Single-Task Learning
The conventional training approaches often face challenges in balancing the breadth of multi-task learning (MTL) with the depth of single-task learning (STL). To address this issue, we introduce the Multi-Task to Single-Task (MT2ST) framework, a groundbreaking approach that can combine the generalizability of MTL with the precision of STL. Our work include two strategies: 'Diminish' and 'Switch'. 'Diminish' Strategy will gradually reduce the influence of auxiliary tasks, while the 'Switch' strategy involves a shift from multi-tasking to single-tasking at a specific timepoint at the training process. In this paper, we propose the Multi-Task to Single-Task (MT2ST) framework, a novel approach that significantly enhances the efficiency and accuracy of word embedding training while concurrently addressing prevalent issues such as overfitting. Our empirical studies demonstrate that MT2ST can reduce training time by 67\% when contrasted with single-task learning approaches, and by 13\% compared to traditional multi-task learning methods. These findings underscore MT2ST's potential to be a powerful tools for word embedding training acceleration.